library(roahd)
Functional Data Analysis is one of the core research areas of our department, you have already seen during class that some of the state-of-the-art techniques were actually developed here. Depth-based techniques find a natural application in Functional Statistics. I assume that many of you already know what we mean by functional data (applied statistics course docet), nevertheless do not worry if you do not: I have structured this lab as a very basic primer to understand how to do some methodological and applied work in the framework of functional data analysis (FDA).
According to the FDA model, data can be seen as measurements of a quantity/quantities along a given, independent and continuous indexing variable (such as time or space). Observations are then treated as random functions and can be viewed as trajectories of stochastic processes defined on a given infinite dimensional functional space.
A natural way to reason (and actually comprehend why they need specific and peculiar techniques) is to think about them as relatively dense longitudinal data: unlike tabular data, functional data are not invariant to permutations of their dimensions (i.e., if I switch two columns of a standard multivariate dataset no harm is done, if I do it with a functional one, I will create a bloody mess).
Given its ubiquitous presence in applications FDA has been a very
active research field in the past decades, with a plethora of R packages
developed for helping practitioner efficiently and effectively employ
such techniques. In what follows we will focus our attention on the
roahd (RObust Analysis of High Dimensional Data) package
(developed by members and former members of our Department of
Mathematics): a package meant to collect and provide methods
for the analysis of univariate and multivariate functional datasets
through the use of robust methods. First off we will dedicate some time
in understanding how to represent these complex infinite
dimensional objects (our computers only deal with finite approximations)
and how to simulate functional data . We will then focus on
computation of depths and outlier detection. This lab WILL NOT provide a
thorough treatment of FDA, but we will limit to study some introductory
concepts on how to deal with univariate (and bivariate, only very
briefly) functional datasets.
The way roahd package represents functional objects is
by providing an evenly spaced grid \(I=\left[t_{0}, t_{1}, \ldots,
t_{P-1}\right]\) \(\left(t_{j}-t_{j-1}=h>0, \: \forall j=1,
\ldots, P-1\right)\) over which the functional observations \(D_{i, j}=X_{i}\left(t_{j}\right)\)
(i.e. the \(i\)th function
evaluated at the \(j\)th point of the
grid) \(\forall i=1, \ldots, N\) and
\(\forall j=0, \ldots, P-1\) are
measured. This is very conveniently handled by the fData
object class. In particular, the following model is considered for the
generation of data: \[
X(t)=m(t)+\varepsilon(t), \text { for all } t \text { in } I
\] where \(m(t)\) is the mean
function (deterministic) and \(\varepsilon(t)\) is a centered Gaussian
process with covariance function \(C(\cdot,\cdot)\). That is to say: \[
\operatorname{Cov}(\varepsilon(\mathrm{s}), \varepsilon(t))=C(s, t),
\text { with } s, t \text { in } I
\] The employed structure for \(C(s,
t)\) is the Exponential covariance function:
\[ C(s, t)=\alpha e^{-\beta|s-t|} \] Where \(alpha\) controls the point-wise variance, and \(\beta\) the degree of covariance between points.
P <- 101
grid <- seq( 0, 1, length.out = P)
alpha <- 0.2
beta <- 0.2
C_st <- exp_cov_function( grid, alpha, beta )
dim(C_st) # note of course it is a covariance matrix
## [1] 101 101
C_st contains a \(P \times
P\) matrix of values.
image( C_st,
main = 'Exponential covariance function',
xlab = 'grid', ylab = 'grid')
After having defined the mean function \(m(t)\)
m <- sin(pi*grid)+sin(2*pi*grid)
we are ready to generate functional data as follows
n <- 100
set.seed(26111992)
data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)
dim(data) # 100 functions evaluated at 101 points each
## [1] 100 101
The output of the previous chunk is actually a \(n \times P\) matrix, where \(P\) is equal to the length of the grid. That is, instead of having a “proper” functional datum, I have its evaluation on a relatively fine grid (this is actually the best we can do). We can plot it with:
matplot(grid,t(data), type="l", col=adjustcolor(col=1,alpha.f = .4))
lines(grid,m, col="blue", lwd=5) # add something else on top of existing graph
Or even better we exploit the features of the roahd
package to construct a functional object:
f_data <- fData(grid, data)
plot(f_data) # what happens if I do plot(data)?
lines(grid,m, col="black", lwd=5)
The first command defines an object of class fdata
class(f_data)
## [1] "fData"
with dedicated methods for plots, the four basic algebraic
operations, subsetting and much more. In particular, using objects of
class fdata makes the computation of depth measures and
related quantities much easier (see next section). For a thorough
account on the potential of the roahd package, check this
paper.
By changing the hyperparameters \(\alpha\) and \(\beta\) in the Exponential covariance function, we can generate functional datum with different degree of dependence and variability. In details, \(\alpha\) controls the overall level of variability in the signal, while the parameter \(\beta\) affects the autocorrelation length of the signal’s noise, with lower values of \(\beta\) leading to wider correlation lengths and vice-versa
alpha <- 1
beta <- 0.2
C_st <- exp_cov_function( grid, alpha, beta )
data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)
f_data <- fData(grid,data)
plot(f_data, main="High overall level of variability")
alpha <- .1
beta <- 0.0001
C_st <- exp_cov_function( grid, alpha, beta )
data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)
f_data <- fData(grid,data)
plot(f_data, main="High smothness")
alpha <- .1
beta <- 100
C_st <- exp_cov_function( grid, alpha, beta )
data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)
f_data <- fData(grid,data)
plot(f_data, main="Virtually uncorrelated signals")
Let us now consider the data you have seen in the Case Study: ECG
signals. The registered and smoothed signals are contained in the
mfD_healthy object.
data("mfD_healthy") # included in roahd
univariate_fdata <- mfD_healthy$fDList[[1]] # I consider the first lead only
print(class(univariate_fdata)) #fData object
## [1] "fData"
plot(univariate_fdata) # can do it directly with an fData object
With roadhd it is very easy to compute Band depths and
Modified band depths for a given fdata object:
band_depth <- BD(Data = univariate_fdata)
modified_band_depth <- MBD(Data = univariate_fdata)
We can compute the median curve
median_curve <- median_fData(fData = univariate_fdata, type = "MBD") # still an fData object
Or manually by firstly computing the Modified Band Depth and then identifying the curve with max MBD
median_curve_manual <- univariate_fdata[which.max(modified_band_depth),] # still an fData object
all(median_curve_manual$values==median_curve$values)
## [1] TRUE
plot(univariate_fdata)
grid_ecg <- seq(median_curve_manual$t0,median_curve_manual$tP,by=median_curve_manual$h)
lines(grid_ecg,median_curve_manual$values)
What if we used the Tukey Median for a functional datum? Let’s try it:
set.seed(2)
n = 32
# generate functional datset
grid = seq(0,1,length.out=1000)
alpha <- 1
beta <- 0.2
C_st <- exp_cov_function( grid, alpha, beta )
m <- sin(pi*grid)+sin(2*pi*grid)
data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)
f_data = fData(grid=grid, values=data) # cast to fData
plot(f_data)
median.mbd = median_fData(fData = f_data, type = "MBD")
lines(grid, median.mbd$values, col="black", lwd=5)
library(DepthProc)
## Caricamento del pacchetto richiesto: ggplot2
## Caricamento del pacchetto richiesto: Rcpp
## Caricamento del pacchetto richiesto: rrcov
## Caricamento del pacchetto richiesto: robustbase
## Scalable Robust Estimators with High Breakdown Point (version 1.7-4)
## Caricamento del pacchetto richiesto: MASS
## Caricamento del pacchetto richiesto: np
## Nonparametric Kernel Methods for Mixed Datatypes (version 0.60-17)
## [vignette("np_faq",package="np") provides answers to frequently asked questions]
## [vignette("np",package="np") an overview]
## [vignette("entropy_np",package="np") an overview of entropy-based methods]
##
## Caricamento pacchetto: 'DepthProc'
## Il seguente oggetto è mascherato da 'package:base':
##
## as.matrix
tukey.depth=depth(u=f_data$values,method='Tukey')
tukey.deepest.idx = which(tukey.depth==max(tukey.depth))
lines(grid, f_data$values[tukey.deepest.idx[1],], col="red", lwd=5)
It does not seem like a correct estimator for location… This is due to the fact that as \(d\) increases, heuristic approaches are utilised for classical depth measures (such as Tukey depth), yielding poor estimation. Remember, in the multivariate setting, depth measures deal with the “center-outward order”, whereas in FDA we try to obtain a “down-upward order”.
Epigraph Index (EI) and Hypograph Index (HI) or their corresponding
Modified versions (MEI and MHI) for providing down-upward/up-downward
order of data can also be computed: see functions EI,
HI, MEI, MHI.
We recall the definition of the (sample) Modified Epigraph Index (MEI): \[ MEI(f_i) = N^{-1} \sum_{i\neq j} \lambda ( \{t \in \mathcal{I}: f_j(t) \geq f_i(t) \}) \] that is: for the \(i\)th function, the mean, with respect to all other functions \(j\) of the lebesgue measure, ie. total time or total length of subset of domain, such that the \(j\)th function has a higher value than the \(i\) one.
For the Hypograph index, we just change the sign: we count the mean “time” other functions have a lower value than the \(i\)th one.
The sample \(L_{n}\operatorname{grade}(x_i)\) for the
\(i\)th corresponds to the
MEI, except that the divisor of the sum also takes into
account the length of the interval over which we measure the functions.
Same goes for \(SL_{n}\operatorname{grade}(x_i)\).
As an example, let us use them to compute the Spearman’s correlation index between the first and second lead of the ECG signals. Recall that for a bivariate functional dataset \([\mathbf{x}, \mathbf{y}]\), the estimate for Spearman’s correlation index \(\hat{\rho}_{s}(\mathbf{x}, \mathbf{y})\) is given
\[ \hat{\rho}_{s}(\mathbf{x}, \mathbf{y})=\hat{\rho}_{p}\left(I L\operatorname{grade}_{n}(\mathbf{x}), I L\operatorname{grade}_{n}(\mathbf{y})\right) \] where \(\hat{\rho}_{p}\) is the sample Pearson correlation coefficient and \(I L\operatorname{grade}_{n}\) is the Inferior-Length w.r.t the sample and \[ \begin{aligned} &I L_{n}\operatorname{grade}(\mathbf{x})=\left(I L_{n}\operatorname{grade}\left(x_{1}\right), I L_{n}\operatorname{grade}\left(x_{2}\right), \ldots, I L_{n}\operatorname{grade}\left(x_{n}\right)\right) \\ &I L_{n}\operatorname{grade}(\mathbf{y})=\left(I L_{n}\operatorname{grade}\left(y_{1}\right), I L_{n}\operatorname{grade}\left(y_{2}\right), \ldots, I L_{n}\operatorname{grade}\left(y_{n}\right)\right) . \end{aligned} \] where the inferior length grade \(I L\operatorname{grade}_{n}(x_i)\) can be interpreted as the “proportion of time” that the sample \(x_1,\ldots,x_n\) is smaller than \(x_i\), that is it defines the relative position of a curve with respect to the sample (it pretty much counts what percentage of the domain curve \(i\) is higher than other samples)
The Spearman’s correlation index is immediately obtained via the
cor_spearman function
bivariate_data <- as.mfData(list(mfD_healthy$fDList[[1]], mfD_healthy$fDList[[2]]))
plot(bivariate_data)
cor_spearman(bivariate_data, ordering='MHI')
## [1] 0.6811366
Or actually we can manually compute it:
MHI_first_lead <- MHI(bivariate_data$fDList[[1]]) # modified hypograph
MHI_second_lead <- MHI(bivariate_data$fDList[[2]])
cor(MHI_first_lead, MHI_second_lead)
## [1] 0.6811366
Coding tip: never forget the power of functional programming!
do.call(args = lapply(1:2, function(ind)
MHI(bivariate_data$fDList[[ind]])) # list of arguments
, what = "cor" # name of the function
)
## [1] 0.6811366
We conclude this lab by looking at some graphical tools for identyfing outliers in a functional sample. Consider the following univariate functional data
n = 100
alpha <- 0.2
beta <- 0.002
C_st <- exp_cov_function( grid, alpha, beta )
data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)
f_data <- fData(grid,data)
We consider two cases: first a dataset with some magnitude outliers, obtained inflating the last 10 curves by a number generated from a uniform distribution in \([2,3]\)
set.seed(33) # reproducibility
outlier_share <- .1
n_outliers <- n*outlier_share
out_highlighter <- rep(c(1,2),c(n-n_outliers,n_outliers))
f_data_temp <- f_data[1:(n*(1-outlier_share)) ,] # Coding tip: subsetting is mabe possible by the S3 class fdata
mag_temp <- f_data[(n*(1-outlier_share)+1):n,] * runif(10,2,3)
f_data_mag <- append_fData(f_data_temp,mag_temp)
plot(f_data_mag, col=out_highlighter)
And then some shape outliers by shifting the generating mechanism by
the quantity shift_q:
shift_q <- .5
mu_warp=mu=sin(pi*grid+shift_q)+sin(2*pi*grid+shift_q)
shape_temp=generate_gauss_fdata(N = n_outliers, mu_warp, Cov=C_st)
shape_temp=fData(grid,shape_temp)
f_data_shape=append_fData(f_data_temp,shape_temp)
plot(f_data_shape, col=out_highlighter)
Now let us see whether we can employ the two plotting tools we have seen in class, namely functional boxplots and outliegrams to detect magnitude and shape outliers, respectively.
invisible(fbplot(f_data_mag, main="Magnitude outliers"))
Recall:
I used invisible() in the previous chunk to avoid the
output of the call to be printed on the console (and in the knitted html
file). If I do not use it I obtain the following:
#fbplot(f_data_shape, main="Shape outliers")
Now, we plot the outliergram.
invisible(outliergram(f_data_shape))
We appreciate how the shape outliers are almost entirely missed by the functional boxplot, whereas the outliergram effectively uncovers them. Recall the important result:
\[ MBD(f) \leq a_0 + a_1 MEI(f) + a_2 N^2 MEI^2(f) \] where MBD represents the Modified Band Depth and MEI the Modified Epygraph Index.
As usual, saving the output of the plots to an object allows to recover the ID of the identified outliers:
out_shape <- outliergram(f_data_shape, display = FALSE)
out_shape$ID_outliers
## [1] 91 92 93 94 95 96 97 98 99 100
The outliergram is actually based on some simple manipulations of MEI and MBD, that can be hard coded:
MEI_out_shape <- MEI(f_data_shape)
MBD_out_shape <- MBD(f_data_shape)
a_0 <- a_2 <- -2/(f_data_shape$N*(f_data_shape$N-1))
a_1 <- 2*(f_data_shape$N+1)/(f_data_shape$N-1)
d_manual <- a_0+a_1*MEI_out_shape+a_2*f_data_shape$N^2*MEI_out_shape^2-MBD_out_shape
critical.quantile = quantile(d_manual,probs = .75)
inflation =out_shape$Fvalue
inter.q.range = IQR(x = d_manual)
ID_outliers_manual <- which(d_manual > critical.quantile + inflation *inter.q.range )
Let us check that our manual computations are correct:
all(dplyr::near(d_manual, out_shape$d))
## [1] TRUE
all(ID_outliers_manual==out_shape$ID_outliers)
## [1] TRUE
One last comment: we have seen in class that the value \(F\) used to build the fences in the
functional boxplot can be adjusted. Despite the algorithmic
implementation/ mathematical theory behind it being quite tedious, the
automated selection of \(F\) via the
roahd package is actually pretty easy:
set.seed(22)
fbplot(f_data_mag, main="Magnitude outliers",adjust = F)
## $Depth
## [1] 0.43266061 0.44557414 0.14673778 0.23181859 0.27697455 0.39373939
## [7] 0.37845980 0.24007152 0.32229091 0.51057212 0.36747838 0.16524040
## [13] 0.50228525 0.32346061 0.18765010 0.47359798 0.28053818 0.46371354
## [19] 0.05319111 0.51289253 0.38057091 0.38679071 0.49317657 0.19332404
## [25] 0.41456040 0.41412768 0.49339838 0.41633576 0.34986222 0.24021657
## [31] 0.49830626 0.50972606 0.10227394 0.51254788 0.40362545 0.47396566
## [37] 0.48527192 0.51431515 0.45584525 0.49268242 0.27655596 0.51354828
## [43] 0.36665374 0.32214141 0.49133051 0.39458626 0.13835919 0.50617293
## [49] 0.34231394 0.30308364 0.48971960 0.30998545 0.11199232 0.20417293
## [55] 0.36579192 0.26277293 0.41747596 0.49104929 0.46184444 0.16736768
## [61] 0.33736687 0.51357697 0.46712485 0.20010182 0.47614222 0.43161293
## [67] 0.50855758 0.07877455 0.47681414 0.32361333 0.10765657 0.46227111
## [73] 0.28913414 0.04148000 0.44161899 0.49318505 0.51117091 0.51179434
## [79] 0.44490626 0.51419515 0.49036525 0.43991354 0.25115556 0.50871717
## [85] 0.50398263 0.22841657 0.15565293 0.33634263 0.43330788 0.22293818
## [91] 0.06640808 0.02221737 0.04468525 0.19820444 0.19148566 0.23793455
## [97] 0.16695798 0.22435515 0.22609455 0.20964444
##
## $Fvalue
## [1] 1.5
##
## $ID_outliers
## [1] 71 91 92 93 94 95 96 97 98 99 100
outliergram(f_data_mag,adjust = F)
## $Fvalue
## [1] 1.5
##
## $d
## [1] 6.806010e-04 8.885735e-04 1.127495e-03 4.222739e-04 1.098482e-03
## [6] 2.947677e-04 3.419747e-04 3.475376e-04 6.537598e-04 4.399497e-04
## [11] 9.680089e-04 4.371941e-04 6.715473e-04 1.124969e-03 3.084394e-04
## [16] 6.551911e-04 3.353840e-04 1.115539e-03 1.284798e-04 2.108897e-04
## [21] 8.721325e-04 1.582768e-03 7.557689e-04 7.828232e-04 1.627989e-03
## [26] 2.835160e-04 6.863384e-04 5.127749e-04 1.023838e-03 6.640614e-04
## [31] 5.331038e-04 8.907143e-04 3.741331e-04 7.272558e-04 1.011489e-03
## [36] 1.545786e-03 1.936275e-03 5.631968e-04 8.601679e-04 1.278783e-03
## [41] 1.544740e-03 6.241261e-04 1.781773e-03 1.947660e-03 3.235354e-04
## [46] 5.231675e-04 4.076717e-04 1.398364e-03 1.683872e-03 4.323182e-04
## [51] 9.824063e-04 9.548436e-04 2.376160e-04 3.155505e-04 2.615335e-04
## [56] 4.203990e-04 8.011796e-04 4.132453e-04 1.999279e-03 1.126823e-03
## [61] 2.362244e-04 3.893648e-04 2.560386e-04 1.544756e-03 2.064194e-04
## [66] 9.710699e-04 1.063515e-03 2.523182e-04 3.516717e-04 4.633204e-04
## [71] 2.061798e-03 3.111790e-04 9.722778e-04 3.146659e-04 3.749293e-04
## [76] 1.770198e-03 3.815836e-04 1.979132e-03 4.656111e-04 7.241568e-04
## [81] 2.192000e-03 1.256224e-03 1.123668e-03 4.006432e-04 3.341446e-04
## [86] 8.841947e-04 2.120063e-04 5.653574e-04 1.529212e-03 1.786094e-03
## [91] 4.759596e-05 2.009212e-05 3.808000e-05 5.916414e-02 2.988338e-01
## [96] 7.055863e-02 2.644220e-01 3.276808e-02 1.815623e-01 1.279333e-02
##
## $ID_outliers
## [1] 94 95 96 97 98 99 100